During the second week of each unit, we’ll “walk through” a basic research workflow, or data analysis process, modeled after the Data-Intensive Research Workflow from Learning Analytics Goes to School (Krumm et al., 2018):
Figure 2.2 Steps of Data-Intensive Research Workflow
Each walkthrough will focus on a basic analysis guided by the social network perspective.
This week, our focus will be on preparing relational data for analysis, looking at some basic network stats, and creating a network visualization that helps illustrate key findings. Specifically, the Unit 1 Walkthrough will cover the following workflow topics:
Prepare: Prior to analysis, we’ll take a look at the context from which our data came, formulate some research questions, and get introduced the {igraph} R package for network analysis.
Wrangle: Wrangling data entails the work of manipulating, cleaning, transforming, and merging data. In section 2 we focus on importing network data, converting our familiar data frames into a network object that can be analyzed and graphed, and learn about “simple graphs.”
Explore: In section 3, we calculate some basic network descriptives and learn how to summarize the descriptives through a network visualization.
Model: While we won’t dig into approaches for modeling network data Unit 3, we will take a quick look at some approaches used in the study guiding this walkthrough.
Communicate: We’ll learn more about communicating key findings next week, but for now will learn some basic components of a data product.
Prior to analysis, it’s critical to understand the context and data sources available so you can formulate useful questions that can be feasibly addressed by your data. For this section, we’ll focus on the following topics:
In Social Network Analysis and Education: Theory, Methods & Applications, Carolyn (2013) notes that:
the social network perspective is one concerned with the structure of relations and the implication this structure has on individual or group behavior and attitudes
More specifically, Carolyn cites the following four features used by Freeman (2004) to define the social network perspective:
Social network analysis is motivated by a relational intuition based on ties connecting social actors.
It is firmly grounded in systematic empirical data.
It makes use of graphic imagery to represent actors and their relations with one another.
It relies on mathematical and/or computational models to succinctly represent the complexity of social life.
For Unit 1, our walkthrough will be guided by previous research and evaluation work conducted by the Friday Institute for Educational Innovation as part of the Massively Open Online Courses for Educators (MOOC-Ed) initiative.
Take a quick look at the Description of the Dataset section from the Massively Open Online Course for Educators (MOOC-Ed) network dataset BJET article and the accompanying data sets stored on Harvard Dataverse that we’ll be using for this walkthrough.
In the space below, type a brief response to the following questions:
What were some of the steps necessary to construct this dataset?
What two “node attributes” from the dataset that might be useful for predicting participants who may be more engaged or central to the network? Why did you select those two?
What else do you notice/wonder about this dataset?
A Social Network Perspective on Peer Supported Learning in MOOC-Eds was framed by three primary research questions related to peer supported learning:
What are the patterns of peer interaction and the structure of peer networks that emerge over the course of a MOOC-Ed?
To what extent do participant and network attributes (e.g., homophily, reciprocity, transitivity) account for the structure of these networks?
To what extent do these networks result in the co-construction of new knowledge?
For our very first walkthrough, we are going to focus exclusively on RQ1 from the original study and our question of interest about our educator network is:
To what extent, did educators engage with other participants in the discussion forums?
Based on what you know about networks and the context so far, what subquestions or more specific research questions might ask we ask that a social network perspective might be able to answer?
In the space below, type a brief response to the following questions:
-
We’ll revisit your response towards the end and provide an opportunity to refine your research question after you know the data a little better.
As highlighted in Chapter 6 of Data Science in Education Using R (DSIEUR):
Packages are shareable collections of R code that can contain functions, data, and/or documentation. Packages increase the functionality of R by providing access to additional functions to suit a variety of needs.
Let’s check to see which packages have already been loaded into our RStudio Cloud workspace. Take a look at the the Files, Plots, & Packages Pane in the lower right hand corner of RStudio Cloud to make sure these packages have been installed and loaded:
You should see some familiar tidytext packages from our Getting Started Walkthrough like {dplyr} and {readr} which we’ll be using again shortly. You should also see an important package call {igraph} that will rely on heavily for our network analyses.
If you are working in RStudio Desktop, or notice that the packages have not been installed and/or loaded, run the following install.packages() function code to install the {tidyverse} and {igraph} packages:
install.packages("tidyverse")
install.packages("igraph")
Let’s go ahead and use library() function for the {tidyverse} package just to review the other packages from the tidyverse collection of packages that this package contains:
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.1.2 ✓ dplyr 1.0.6
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
For our Unit 1 Walkthrough, we will rely heavily on the igraph network analysis package. The main goals of the igraph package and the collection of network analysis tools it contains are to provide a set of data types and functions for:
pain-free implementation of graph algorithms,
fast handling of large graphs, with millions of vertices and edges.
allowing rapid prototyping via high level languages like R.
Run the code chunk below to load the {igraph] library:
library(igraph)
##
## Attaching package: 'igraph'
## The following objects are masked from 'package:dplyr':
##
## as_data_frame, groups, union
## The following objects are masked from 'package:purrr':
##
## compose, simplify
## The following object is masked from 'package:tidyr':
##
## crossing
## The following object is masked from 'package:tibble':
##
## as_data_frame
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
Take a look at the messages from the output of after loading the igraph library. What tidyverse packages share identically named functions with igraph?
Write your response in the space below.
-
Congrats! You made it to the end of data wrangling section and are ready to start analysis! Before proceeding further, knit your document and check to see if you encounter any errors.
In general, data wrangling involves some combination of cleaning, reshaping, transforming, and merging data (Wickham & Grolemund, 2017). The importance of data wrangling is difficult to overstate, as it involves the initial steps of going from the raw data to a dataset that can be explored and modeled (Krumm et al, 2018).
For our data wrangling this week, we’re keeping it simple since working with network data is a bit of a departure from our working with rectangular data frames. Our primary goals for Unit 1 are learning how to:
Import Data. Before working with data, we need to “read” it into R and once imported, we’ll take at different ways to view our data in R.
Create a Network Object.
Simplify Network. Finally, we’ll learn about a handy simplify() function in the {igraph} package for removing ties that .
To get started, we need to import, or “read”, our data into R. The function used to import your data will depend on the file format of the data you are trying to import, but R is pretty adept at working with many files types.
Take a look in the /data folder in your Files pane. You should see the following .csv files:
dlt1-edgelist.csv
dlt1-nodes.csv
As its name implies, the first file dlt1-edgelist.csv is an edge-list that contains information about each tie, or relation between two actors in a network. In this context, a “tie” is a reply by one participant in the discussion forum to the post of another participant - or in some cases to their own post! These ties between a single actor are called “self-loops” and as we’ll see later in this section, igraph has a special function to remove these self loops from a sociogram, or network visualization.
The edge-list format is slightly different than other formats you have likely worked with before in that the values in the first two columns each row represent a dyad, or tie between two nodes in a network. An edge-list can also contain other information regarding the strength, duration, or frequency of the relationship, sometime called “weight”, in addition to other “edge attributes.”
In addition to
Sender = Unique identifier of author of comment
Receiver = Unique identifier of identified recipient of comment
Timestamp = Time comment was posted
Parent = Primary category or topic of thread
Category = Subcategory or subtopic of thread
Thread_id = Unique identifier of a thread
Comment_id = Unique identifier of a comment\
Let’s use the read_csv() function from the {readr} package introduced in the Getting Started walkthrough to read in our edge-list and print the new ties data frame:
ties <- read_csv("data/dlt1-edgelist.csv",
col_types = cols(Sender = col_character(),
Receiver = col_character(),
`Category Text` = col_skip(),
`Comment ID` = col_character(),
`Discussion ID` = col_character()))
ties
Note the addition of the col_types = argument for changing the column types to character strings since the numbers for for those particular columns indicate actors (Sender and Reciever) and attributes (Comment_ID and Discussion_Id). We also skipped the Category Text.
RStudio Tip: Importing data and dealing with data types can be a bit tricky, especially for beginners. Fortunately, RStudio has an “Import Dataset” feature in the Environment Pane that can help you use the {readr} package and associated functions to greatly facilitate this process.
Consider the example pictured below of a discussion thread from the Planning for the Digital Learning Transition in K-12 Schools (DLT 1) where our data orginated. This thread was initiated by participant I, so the comments by J and N are considered to be directed at I. The comment of B, however, is a direct response to the comment by N as signaled by the use of the quote-feature as well as the explicit mentioning of N’s name within B’s comment.
Now answer the following questions as they relate to the DLT 1 edge-list we just read into R.
Which actors in this thread are the Sender and the Reciever? Which actor is both?
How many dyads are in this thread? Which pairs of actors are dyads?
Sidebar: Unfortunately, these types of nuances in discussion forum data as illustrated by this simple example are rarely captured through automated approaches to constructing networks. Fortunately, the dataset you are working with was carefully reviewed to try and capture more accurately the intended recipients of each reply.
The second file is we’ll be using to help understand out network and the actors involved contains all the nodes or actors (i.e., participants who posted to the discussion forum) as well as some of their attributes such as gender and years of experience in education.
Carolyn (2013) notes that most social network analyses include variables that describe attributes of actors, ones that are either categorical (e.g., sex, race, etc.) or continuous in nature (e.g., test scores, number of times absent, etc.). These attributes that can be incorporated into a network graph or model to making it much more informative and can aid in testing or generating hypotheses.
These attribute variables are typically included in a rectangular array, or dataframe, that mimics the actor-by-attribute that is the dominant convention in social science, i.e. rows represent cases, columns represent variables, and cells consist of values on those variables.
As and aside, Carolyn also refers to this historical preference by researchers for “actor-by-attribute” data, in the absence of relational data in which the actor has been removed their social context, as the “sociological meatgrinder” in action. Specifically, this historical approach assumes that the actor does not interact with anyone else in the study and that outcomes are solely dependent of the characteristics of the individual.
Regardless, let’s read in our node attribute file and take a look at the actors and their attributes included in our dataset:
actors <- read_csv("data/dlt1-nodes.csv",
col_types = cols(UID = col_character(),
Facilitator = col_character(),
expert = col_character(),
connect = col_character()))
Use the code chunk below and a function of your choosing to take a look at the actors data frame:
Match up the attributes included in the node file with the following codebook descriptors. The first one has been done as an example.
Facilitator = Identification of course facilitator (1 = instructor)Before we can begin using many of the functions from the {igraph} for summarizing and visualizing our DLT 1 network, we first need to convert the data frames that we imported into an igraph network object, or an igraph graph.
To do that, we will use the graph_from_data_frame() function. Run the following code to take a look at the help documentation for this function:
?graph_from_data_frame
You probably saw that this particular function takes the following three arguments, two of which are data frames:
d describes the edges of the network. The first two columns are the IDs of the source and the target node for each edge, in our case the Sender and Reviever of a discussion post. The order matters! The following columns are edge attributes such as weight, type, label, or anything else.
vertices starts with a column of node IDs and any following columns are interpreted as node attributes.
directed is determines whether or not to create a directed graph.
Run the following code to specify our ties data frame as the edges of our network, our actors data frame for the vertices of our network and their attributes, and indicate that this is indeed a directed network.
network <- graph_from_data_frame(d = ties,
vertices = actors,
directed = T)
network
## IGRAPH 368cf94 DN-- 445 2529 --
## + attr: name (v/c), Facilitator (v/c), role1 (v/c), experience (v/n),
## | experience2 (v/c), grades (v/c), location (v/c), region (v/c),
## | country (v/c), group (v/c), gender (v/c), expert (v/c), connect
## | (v/c), Timestamp (e/c), Discussion Title (e/c), Discussion Category
## | (e/c), Parent Category (e/c), Discussion Identifier (e/c), Comment ID
## | (e/c), Discussion ID (e/c)
## + edges from 368cf94 (vertex names):
## [1] 360->444 356->444 356->444 344->444 392->444 219->444 318->444 4 ->444
## [9] 355->356 355->444 4 ->444 310->444 248->444 150->444 19 ->310 216->19
## [17] 19 ->444 19 ->4 217->310 385->444 217->444 393->444 217->19 256->219
## + ... omitted several edges
Take a look at the very first line of the output which contains some basic information about our network and answer the following questions:
How many actors and ties are in our network? Is this consistent with the number of observations in our data frames?
The “D” and the “N” indicate that this is a Directed network and has the Name vertex attributes set. Why are the two spaces that follow these letters blank? Hint: check the help files.
Which vertex attribute did igraph interpret as numeric?
As you saw from the network output, our dataset has 2529 edges or ties and just a quick scan of the edges in the network shows that edges like 356 -> 444 occur at least more than once so we know that participant 356 has replied to participant 444 at least twice.
How many unique edges does our network have though?
Fortunately, the {igraph} package has a simplify() function for collapsing these duplicate edges so they are not represented more than once when we want visually depict our network with a sociogram.
Let’s use that function to simplify our network and save it as a simple_network, or a simple graph, which contains no self-loops or duplicate edges and which by default the simplify() function removes:
simple_network <- simplify(network)
simple_network
## IGRAPH 5078917 DN-- 445 1936 --
## + attr: name (v/c), Facilitator (v/c), role1 (v/c), experience (v/n),
## | experience2 (v/c), grades (v/c), location (v/c), region (v/c),
## | country (v/c), group (v/c), gender (v/c), expert (v/c), connect (v/c)
## + edges from 5078917 (vertex names):
## [1] 1->2 1->7 1->22 1->30 1->36 1->41 1->49 1->50 1->68 1->88
## [11] 1->92 1->109 1->112 1->137 1->144 1->154 1->161 1->192 1->195 1->198
## [21] 1->221 1->444 1->445 2->36 2->67 2->104 2->177 2->223 3->2 3->7
## [31] 3->223 3->310 4->5 4->7 4->26 4->29 4->98 4->107 4->193 4->198
## [41] 4->207 4->308 4->444 5->8 5->12 5->21 5->24 5->67 5->107 5->444
## [51] 5->445 6->5 6->7 6->11 6->41 6->42 6->62 6->68 6->100 6->116
## + ... omitted several edges
Take a look at the output for our simple graph now and answer the following questions:
How many unique edges are in the network?
Did we lose any important or potentially useful information by collapsing multiple edges into a single edge?
We noted earlier that edges can also contain attributes such as strength, duration or frequency, sometime called “weight.” These weight can help us better understand the relationship itself, but also aid in visualization and modeling later on.
When we used the simplify() function earlier, it collapsed our duplicate edges but we lost some vital information as a result, namely the frequency of replies among pairs of educators in our discussion forum.
Fortunately, the simplify() function contains an argument that will allow us to count the number of ties between two actors, similar to how we might use the count() function in the {dplyr} package like so:
edge_weights <- count(ties, Sender, Receiver)
edge_weights
In this case, we see that participant 1 replied to participant 144 twice throughout the course.
To add weights to our simplified network, we first need to add a weight variable to the edges in our original network igraph object.
The {igraph} package has a unique syntax for working with attributes of network objects. To add a weight attribute to the E() edges in our network we’ll use the $ operator which can be used to create a new weight variable and use the <- assignment operator to add an initial value of 1 for the weight of each edge.
Let’s put that all together and run the code to add a weight of 1 to each edge in our network
E(network)$weight <- 1
Now let’s take a look at our igraph network object again:
network
## IGRAPH 368cf94 DNW- 445 2529 --
## + attr: name (v/c), Facilitator (v/c), role1 (v/c), experience (v/n),
## | experience2 (v/c), grades (v/c), location (v/c), region (v/c),
## | country (v/c), group (v/c), gender (v/c), expert (v/c), connect
## | (v/c), Timestamp (e/c), Discussion Title (e/c), Discussion Category
## | (e/c), Parent Category (e/c), Discussion Identifier (e/c), Comment ID
## | (e/c), Discussion ID (e/c), weight (e/n)
## + edges from 368cf94 (vertex names):
## [1] 360->444 356->444 356->444 344->444 392->444 219->444 318->444 4 ->444
## [9] 355->356 355->444 4 ->444 310->444 248->444 150->444 19 ->310 216->19
## [17] 19 ->444 19 ->4 217->310 385->444 217->444 393->444 217->19 256->219
## + ... omitted several edges
We can see that our network is now weighted as indicated by the “W” and that our new weight attribute has been added.
weighted_network <- simplify(network,
edge.attr.comb = list(weight="sum")
)
weighted_network
Congrats! You made it to the end of data wrangling section and are ready to start analysis! Before proceeding further, knit your document and check to see if you encounter any errors.
won’t spend a ton of time
mean(edge_weights$n)
## [1] 1.278564
median(edge_weights$n)
## [1] 1
hist(edge_weights$n, breaks = 10)
node_degree <- degree(weighted_network, mode="all")
hist(node_degree, breaks = 30)
mean(node_degree)
## [1] 8.701124
median(node_degree)
## [1] 4
in_degree <- degree(weighted_network, mode="in")
hist(in_degree, breaks = 30)
mean(in_degree)
## [1] 4.350562
median(in_degree)
## [1] 1
Use the code chunk
In the space below, write your interpretation of these results.
-
igraph arguments only
plot(weighted_network)
Hair ball
plot(weighted_network,
vertex.label = NA)
plot(weighted_network,
vertex.label = NA,
vertex.size = 1)
degree
plot(weighted_network,
vertex.label = NA,
vertex.size = node_degree)
plot(weighted_network,
vertex.label = NA,
vertex.size = node_degree*.1)
plot(weighted_network,
vertex.label = NA,
vertex.size = node_degree*.1,
edge.arrow.size = .04)
plot(weighted_network,
vertex.label = NA,
vertex.size = node_degree*.05,
edge.arrow.size = .04,
edge.width = .2)
plot(weighted_network,
vertex.label = NA,
vertex.size = node_degree*.1,
edge.arrow.size = .04,
edge.width = E(weighted_network)$weight)
plot(weighted_network,
vertex.label = NA,
vertex.size = node_degree*.05,
edge.arrow.size = .05,
edge.width = E(weighted_network)$weight/5)
plot(weighted_network,
vertex.label = NA,
vertex.size = node_degree*.05,
edge.arrow.size = .05,
edge.width = E(weighted_network)$weight/5,
layout = layout_with_fr)
plot(weighted_network,
vertex.label = NA,
vertex.size = node_degree*.05,
edge.arrow.size = .05,
edge.width = E(weighted_network)$weight/5,
layout = layout_with_kk)
plot(weighted_network,
vertex.label = NA,
vertex.size = node_degree*.05,
edge.arrow.size = .05,
edge.width = E(weighted_network)$weight/5,
layout = layout_in_circle)
Congrats! You made it to the end of the Explore section and are ready to learn a little about network modeling! Before proceeding further, knit your document and check to see if you encounter any errors.
As highlighted in Chapter 3 of Data Science in Education Using R, the Model step of the data science process entails “using statistical models, from simple to complex, to understand trends and patterns in the data.” The authors note that while descriptive statistics and data visualization during the Explore step can help us to identify patterns and relationships in our data, statistical models can be used to help us determine if relationships, patterns and trends are actually meaningful.
We will not explore the use of models for SNA until Unit 3, recall from the PREPARE section that the Kellogg et al. study was guided by the following questions:
What are the patterns of peer interaction and the structure of peer networks that emerge over the course of a MOOC-Ed?
To what extent do participant and network attributes (e.g., homophily, reciprocity, transitivity) account for the structure of these networks?
To what extent do these networks result in the co-construction of new knowledge?
To address Question 1, actors in the network were categorized into distinct mutually exclusive groups using the core-periphery and regular equivalence functions of UCINET. The former used the CORR algorithm to divide the network into actors that are part of a densely connected subgroup, or “core”, from those that are part of the sparsely connected periphery. Regular equivalence employs the REGE blockmodeling algorithm to partition, or group, actors in the network based on the similarity of their ties to others with similar ties. In essence, blockmodeling provides a systematic way for categorizing educators based on the ways in which they interacted with peers.
As we saw upon just a basic visual inspection of our network during the Explore section, there was a small core of highly connected participants surrounded by those on the “periphery,” or edge, of the network with very few connections. In the DLT 2 course, those on the periphery made up roughly 90% of network. The study also found relatively high levels of reciprocation, but also found that roughly a quarter of participants were characterized as “brodcasters” – educators who initiated a discussion thread, but neither reciprocated with those who replied, nor posted to threads initiated by others.
To address Question 2, this study use the exponential family of random graph models (ERGM; also known as p* models), which provide a statistical approach to network modeling that addresses the complex dependencies within networks. ERGMs predict network ties and determine the statistical likelihood of a given network structure, based on an assumed dependency structure, the attributes of the individuals (e.g., gender, popularity, location, previous ties) and prior states of the network.
Finally, in Learning Analytics Goes to School, the authors describe modeling as simply developing a mathematical summary of a dataset and note that there are two general types to modeling: unsupervised and supervised learning. Clustering algorithms, like REGE function used to address Q2, are used to explore the structure of a dataset, while supervised models “help to quantify relationships between features and a known outcome,” similar to the use of actor, relational, and network attributes to predict tie formation among educators in a discussion forum using ERGM models.
Recall from the 1a. Review the Research that you were asked to identify two “node attributes” from the dataset that might be useful for predicting participants who may be more engaged or central to the network.
Take look at page 276 of the article, A social network perspective on peer supported learning in MOOCs for educators. Were your predictions correct?
-
Now that you’ve become a little familiar with this dataset and the social network perspective, what other aspects of this dataset, or a dataset you are interested in exploring, might be modeled to better understand the structure of this network, and the relationships between structures or attributes and learning outcomes of interest for those in network?
Use the space below to write a brief response:
-
The final step in our workflow/process is sharing the results of analysis with wider audience. Krumm et al. (2018) have outline the following 3-step process for communicating with education stakeholders what you have learned through analysis:
Select. Communicating what one has learned involves selecting among those analyses that are most important and most useful to an intended audience, as well as selecting a form for displaying that information, such as a graph or table in static or interactive form, i.e. a “data product.”
Polish. After creating initial versions of data products, research teams often spend time refining or polishing them, by adding or editing titles, labels, and notations and by working with colors and shapes to highlight key points.
Narrate. Writing a narrative to accompany the data products involves, at a minimum, pairing a data product with its related research question, describing how best to interpret the data product, and explaining the ways in which the data product helps answer the research question.
In Week 4 of Unit 1 we’ll take a look at